Read Graphical Data Analysis with R, Ch. 6, 7
Grading is based both on your graphs and verbal explanations. All graphs must have informative titles and follow all best practices as discussed in class.
Labels do not have to perfect but they have to be legible. Often it is helpful to shorten or abbreviate labels: this can be done before plotting or within the plot functions. Be sure to include all adjustments in your scripts.
We will use the penguins_raw dataset from the palmerpenguins package.
ggparcoord from the GGally package. Choose parameters to help identify trends. What do you observe?library(GGally)
library(palmerpenguins)
data <- Filter(is.numeric, penguins_raw)
ggparcoord(data, columns = 1 : 7, alphaLines =.3, scale ="uniminmax") + geom_vline(xintercept =1:7, color ="lightblue")
Sample number and culmen length, Body Mass (g) and Delta 15 N (o/oo), tend to have negative correlation. Flipper length is either high or low not much in the middle at .50. Average culmen legnth is low with some outliers having high culmen length.
#groupby Island
ggparcoord(penguins_raw, columns = c(2, 10:13, 15:16), groupColumn = 5, alphaLines =.3, scale ="uniminmax")
Penguins from Biscoe island tend to have higher Culum length and lower culum depth & Delta 15N compared with other Island. Penguins from Dream and Togersen always have similar characteristics.
pulitzer dataset in the fivethirtyeight package with brushing using the parcoords() function in the parcoords package.library(dplyr)
library(fivethirtyeight)
library(parcoords)
pulitzer %>% arrange(newspaper) %>%
parcoords(
rownames = F,
brushMode = "1D-axes",
reorderable = T,
queue = T)
New York Times, USA Today, Los Angeles Times, and Wall Street Journal. These journals have a high number of pulitzer prizes in all years compared to the other journals. They also have high daily circulation in both 2004 and 2013 than most other journals.
For this problem we will use the DoctorVisits data set in the AER package. (Note: for this package you need data("DoctorVisits") to load the data.)
visits to be the dependent variable. Create a new factor column, numvisits, based on visits, that contains three levels: “0”, “1”, and “2+”. (Tidyverse option: forcats::fct_collapse()). Why is this a reasonable choice?library('AER')
library(forcats)
data("DoctorVisits")
DoctorVisits$visits <- as.factor(DoctorVisits$visits)
DoctorVisits <- DoctorVisits %>%
mutate(numvisits = fct_collapse(visits,
'0' = c("0"),
'1' = c("1"),
'2+' = c("2", "3", "4", "5", "6", "7", "8", "9")
))
head(DoctorVisits)
## visits gender age income illness reduced health private freepoor freerepat
## 1 1 female 0.19 0.55 1 4 1 yes no no
## 2 1 female 0.19 0.45 1 2 1 yes no no
## 3 1 male 0.19 0.90 3 0 0 no no no
## 4 1 male 0.19 0.15 1 0 0 no no no
## 5 1 male 0.19 0.45 2 5 1 no no no
## 6 1 female 0.19 0.35 5 1 9 no no no
## nchronic lchronic numvisits
## 1 no no 1
## 2 no no 1
## 3 no no 1
## 4 no no 1
## 5 yes no 1
## 6 yes no 1
Because people who visit more than or equals to 2 means he/she is a re-visit patient so they can be allocated to one category. Those who visited once may just for some specific reason.
DoctorVisits. Based on the plot, which variables appear to have a strong association with numvisits? medium association? weak or no association? (Make your judgments relative to the other variables.)count <- DoctorVisits %>%
group_by(gender, private, freepoor, freerepat, nchronic, lchronic, numvisits) %>%
summarise(Freq = n())
head(count)
## # A tibble: 6 x 8
## # Groups: gender, private, freepoor, freerepat, nchronic, lchronic [2]
## gender private freepoor freerepat nchronic lchronic numvisits Freq
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <int>
## 1 male no no no no no 0 624
## 2 male no no no no no 1 51
## 3 male no no no no no 2+ 23
## 4 male no no no no yes 0 61
## 5 male no no no no yes 1 11
## 6 male no no no no yes 2+ 3
factor_summary <- DoctorVisits %>% group_by(gender, private, freepoor, freerepat, nchronic, lchronic, numvisits) %>% tally()
pairs(xtabs(n ~ ., factor_summary))
numvisits consistent with your categorization in part (b)? Explain briefly.X1 <- chisq.test(DoctorVisits$numvisits, DoctorVisits$age)
X2 <- chisq.test(DoctorVisits$numvisits, DoctorVisits$income)
X3 <- chisq.test(DoctorVisits$numvisits, DoctorVisits$illness)
X4 <- chisq.test(DoctorVisits$numvisits, DoctorVisits$reduced)
X5 <- chisq.test(DoctorVisits$numvisits, DoctorVisits$health)
X1
##
## Pearson's Chi-squared test
##
## data: DoctorVisits$numvisits and DoctorVisits$age
## X-squared = 154.76, df = 22, p-value < 2.2e-16
X2
##
## Pearson's Chi-squared test
##
## data: DoctorVisits$numvisits and DoctorVisits$income
## X-squared = 97.044, df = 26, p-value = 3.961e-10
X3
##
## Pearson's Chi-squared test
##
## data: DoctorVisits$numvisits and DoctorVisits$illness
## X-squared = 389.35, df = 10, p-value < 2.2e-16
X4
##
## Pearson's Chi-squared test
##
## data: DoctorVisits$numvisits and DoctorVisits$reduced
## X-squared = 965.2, df = 28, p-value < 2.2e-16
X5
##
## Pearson's Chi-squared test
##
## data: DoctorVisits$numvisits and DoctorVisits$health
## X-squared = 244.34, df = 24, p-value < 2.2e-16
gender, lchronic, private and numvisits in stages:gender and numvisitsvcd::mosaic(numvisits ~ gender,
direction = c("v", "h"), highlighting_fill = c("grey90", "cornflowerblue"), count)
gender, lchronic and numvisitsvcd::mosaic(numvisits ~ gender + lchronic,
direction = c("v", "v", "h"), highlighting_fill = c("red", "cornflowerblue"), count)
gender, lchronic, private and numvisitsvcd::mosaic(numvisits ~ gender + lchronic + private,
direction = c("v", "v", "v", "h"), highlighting_fill = c("grey90", "cornflowerblue"), count)
All cuts should be vertical except the last one. The last cut should be the dependent variable. Use appropriate fill colors. What patterns do you observe?
Gender, provate, and Ichronic may not be a good feature for split.
geom_alluvium() from the ggalluvial package to display the same variables as in the last graph of part (d). What new information / perspective does the alluvial plot provide, if any?library(ggalluvial)
ggplot(as.data.frame(count), aes(y = Freq, axis1 = lchronic, axis2 = private)) +
geom_alluvium(aes(fill = lchronic), width = 1/12)
Those who is Ichronic and those who not have totally different patterns.
data <- read.csv('https://raw.githubusercontent.com/nychealth/coronavirus-data/master/by-age.csv')
head(data)
## AGE_GROUP CASE_RATE HOSPITALIZED_RATE DEATH_RATE CASE_COUNT
## 1 0-17 575.53 40.42 0.87 9939
## 2 18-44 2878.40 294.95 22.23 96993
## 3 45-64 4190.32 931.69 210.21 86155
## 4 65-74 4113.90 1807.99 681.57 28761
## 5 75+ 4864.57 2916.95 1725.86 26608
## 6 Citywide 2964.07 695.22 229.70 248945
## HOSPITALIZED_COUNT DEATH_COUNT
## 1 698 15
## 2 9939 749
## 3 19156 4322
## 4 12640 4765
## 5 15955 9440
## 6 58390 19292
HH::likert(AGE_GROUP~., data, positive.order = TRUE,
main = 'NYC Covid-19 test result by age',
xlab = 'percent', ylab = 'age group')